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Abstract — We consider the synthesis of control policies for 
probabilistic systems, modeled by Markov decision processes, 
operating in partially known environments with temporal logic 
specifications. The environment is modeled by a set of Markov 
chains. Each Markov chain describes the behavior of the 
environment in each mode. The mode of the environment, 
however, is not known to the system. Two control objectives 
are considered: maximizing the expected probability and max- 
imizing the worst-case probability that the system satisfies a 
given specification. 

I. Introduction 

In many applications, control systems need to perform 
complex tasks and interact with their (potentially adversarial) 
environments. The correctness of these systems typically 
depends on the behaviors of the environments. For example, 
whether an autonomous vehicle exhibits a correct behavior 
at a pedestrian crossing depends on the behavior of the 
pedestrians, e.g., whether they actually cross the road, remain 
on the same side of the road or step in front of the vehicle 
while it is moving. 

Temporal logics, which were primarily developed by the 
formal methods community for specifying and verifying 
correctness of software and hardware systems, have been 
recently employed to express complex behaviors of control 
systems. Its expressive power offers extensions to properties 
that can be expressed than safety and stability, typically stud- 
ied in the control and hybrid systems domains. In particular, 
[1] shows that the traffic rule enforced in the 2007 DARPA 
Urban Challenge can be precisely described using these 
logics. Furthermore, the recent development of language 
equivalence and simulation notions allows abstraction of con- 
tinuous systems to a purely discrete model [2], [3], [4]. This 
subsequently provides a framework for integrating method- 
ologies from the formal methods and the control-theoretic 
communities and enables formal specification, design and 
verification of control systems with complex behaviors. 

Controller synthesis from temporal logic specifications 
has been considered in [5], [6], [7], [8], assuming static 
environments. Synthesis of reactive controllers that takes into 
account all the possible behaviors of dynamic environments 
can be found in [9], [10]. In this case, the environment 
is treated as an adversary and the synthesis problem can 
be viewed as a two-player game between the system and 
the environment: the environment attempts to falsify the 
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specification while the system attempts to satisfy it [11]. In 
these works, the system is assumed to be deterministic, i.e., 
an available control action in each state enables exactly one 
transition. Controller synthesis for probabilistic systems such 
as Markov decision processes (MDP) has been considered 
in [12], [13]. However, these works assume that at any time 
instance, the state of the system, including the environment, 
as well as their models are fully known. This may not be a 
valid assumption in many applications. For example, in the 
pedestrian crossing problem previously described, the behav- 
ior of the pedestrians depend on their destination, which is 
typically not known to the system. Having to account for all 
the possible behaviors of the pedestrians with respect to all 
their possible destinations may lead to conservative results 
and in many cases, unrealizable specifications. 

Partially observable Markov decision process (POMDP) 
provides a principled mathematical framework to cope with 
partial observability in stochastic domains [14]. Roughly, the 
main idea is to maintain a belief, which is defined as a 
probability distribution over all the possible states. POMDP 
algorithms then operate in the belief space. Unfortunately, it 
has been shown that solving POMDPs exactly is computa- 
tionally intractable [15]. Hence, point-based algorithms have 
been developed to compute an approximate solution based 
on the computation over a representative set of points from 
the belief space rather than the entire belief space [16]. 

In this paper, we take an initial step towards solving 
POMDPs that are subject to temporal logic specifications. 
In particular, we consider the problem where a collection 
of possible environment models is available to the system. 
Different models correspond to different modes of the envi- 
ronment. However, the system does not know in which mode 
the environment is. In addition, the environment may change 
its mode during an execution subject to certain constraints. 
We consider two control objectives: maximizing the expected 
probability and maximizing the worst-case probability that 
the system satisfies a given temporal logic specification. 
The first objective is closely related to solving POMDPs as 
previously described whereas the second objective is closely 
related to solving uncertain MDPs [17]. However, for both 
problems, we aim at maximizing the expected or worst- 
case probability of satisfying a temporal logic specification, 
instead of maximizing the expected or worst-case reward as 
considered in the POMDP and MDP literature. 

The main contribution of this paper is twofold. First, we 
show that the expectation-based synthesis problem can be 
formulated as a control policy synthesis problem for MDPs 
under temporal logic specifications and provide a complete 



solution to the problem. Second, we define a mathematical 
object called adversarial Markov decision process (AMDP) 
and show that the worst-case-based synthesis problem can be 
formulated as a control policy synthesis problem for AMDP. 
A complete solution to the control policy synthesis for 
AMDP is then provided. Finally, we show that the maximum 
worst-case probability that a given specification is satisfied 
does not depend on whether the controller and the adversary 
play alternatively or both the control and adversarial policies 
are computed at the beginning of an execution. 

The rest of the paper is organized as follows: We provide 
useful definitions and descriptions of the formalisms in the 
following section. Section [Til] is dedicated to the problem 
formulation. The expectation-based and the worst-case-based 



control policy synthesis are considered in Section IV and 
Section [V] respectively. Section VI presents an example. 
Finally, Section VII concludes the paper and discusses future 
work. 

II. Preliminaries 

We consider systems that comprise stochastic components. 
In this section, we define the formalisms used in this paper to 
describe such systems and their desired properties. Through- 
out the paper, we let X* , X M and X + denote the set of 
finite, infinite and nonempty finite strings, respectively, of a 
set X. 

A. Automata 

Definition 1: A deterministic Rabin automaton (DRA) is 
a tuple A — (Q,T,,S, q ini t,Acc) where 

• Q is a finite set of states, 

• £ is a finite set called alphabet, 

• <5 : Q x £ — > Q is a transition function, 

• Qinit G Q is the initial state, and 

• Ace C 2 "2 x 2^ is the acceptance condition. 

We use the relation notation, q — > q' to denote 5(q, w) = q'. 

Consider an infinite string a = a a\ ... G £". A run for a 
in a DRA A = (Q, S, 5, qi n it,Acc) is an infinite sequence of 
states q qi ...q n such that q = q. mit and q % -^> q i+1 for all 
i > 0. A run is accepting if there exists a pair (H , K) G Ace 
such that (1) there exists n > such that for all m > n, 
q m $ H, and (2) there exist infinitely many n > such that 
q n € K. 

A string a G X* is accepted by A if there is an accepting 
run of a in A. The language accepted by A, denoted by 
C U (A), is the set of all accepted strings of A. 

B. Linear Temporal Logic 

Linear temporal logic (LTL) is a branch of logic that can 
be used to reason about a time line. An LTL formula is built 
up from a set II of atomic propositions, the logic connectives 
-i, V, A and =>■ and the temporal modal operators O 
("next"), □ ("always"), O ("eventually") and U ("until"). 
An LTL formula over a set II of atomic propositions is 
inductively defined as 

p := True | p | -up | p A <p | O p\ plA p 



where p G II. Other operators can be defined as follows: 
p Aip = -^(^p\/ ->tp), <p ==>■ ip = -itpV ip, Oip = True 14 <p, 
and Dp = -iO—itp, 

Semantics of LTL: LTL formulas are interpreted on infi- 
nite strings over 2 n . Let a = <jqo~\o~2 . . . where G{ G 2 n for 
all i > 0. The satisfaction relation |= is defined inductively 
on LTL formulas as follows: 

• a \= True, 

• for an atomic proposition p G II, a \= p if and only if 
P G cr , 

• a |= -><p if and only if a ^ <p, 

• a |= p\ A p2 if and only if a |= ip\ and a |= p^, 

• c (= Op if and only if <Ji<T2 • • ■ |= <p, and 

• a |= tpi li p2 if an d on ly if there exists j > such 
that CTjCTj+i ... \= <pz and for all i such all < i < j, 

<Ji<Ti+l ... \=Pl. 

Given propositional formulas p\ and p2, examples of 
widely used LTL formulas include a safety formula Dpi 
(read as "always pi"), which simply asserts that property 
px remains invariantly true throughout an execution and a 
reachability formula <>p\ (read as "eventually pi"), which 
states that property p\ becomes true at least once in an 
execution (i.e., there exists a reachable state that satisfies 
Pi). In the example presented later in this paper, we use a 
formula pi U P2 (read as "p\ until pi"), which asserts that 
px has to remain true until p2 becomes true and there is some 
point in an execution where p2 becomes true. 

It can be shown that for any LTL formula p over II, there 
exists a DRA A with alphabet E = 2 n that accepts all 
and only words over II that satisfy p, i.e., C U (A) = {cr G 
(2 n ) w | a |= p}. Such A can be automatically constructed 
using existing tools [18]. We refer the reader to [19], [20], 
[21] for more details on LTL. 

C. Systems and Control Policies 

Definition 2: A (discrete-time) Markov chain (MC) is a 
tuple M. — (S, P,Sj n it,II, L) where 

• S is a countable set of states, 

• P : S x S — >• [0, 1] is the transition probability function 
such that for any state s G S, 2~^ s 'es ^( s j s> ) = 1> 

• Si n it G S is the initial state, 

• II is a set of atomic propositions, and 

• L : S — >• 2 n is a labeling function. 

Definition 3: A Markov decision process (MDP) is a tuple 
M = (S, Act, P, Sinn, II, L) where S, Sinn, II and L are 
defined as in MC and 

• Act is a finite set of actions, and 

• P : S x Act x S —> [0, 1] is the transition probability 
function such that for any state s € S and action a G 

A <*> Ey e s P(«,«y)e {0,1} 

An action a is enabled in state s if and only if 
2~2 s / e s ■f > ( s ' a > s ') = 1- L et Act(s) denote the set of enabled 
actions in s. 

Given a complete system as the composition of all its 
components, we are interested in computing a control policy 
for the system that optimizes certain objectives. We define a 
control policy for a system modeled by an MDP as follows. 



Definition 4: Let Ai = (S, Act, P, s init , U, L) be a 
Markov decision process. A control policy for A4 is a 
function C : S + — *• ^4ct such that C(sqS\ . . . s n ) € Act(s n ) 
for all sqSi ...s„e S' + . 

Let A4 = (5", Act, P,s init ,U, L) be an MDP and C : 
S + —> Act be a control policy for Ai. An infinite sequence 
r M = s o s i • • • on M generated under policy C is called a 
path on X if So = s inlt and P(s;,C(s si . . .Si),s i+1 ) > 
for all i. The subsequence s si . . . s n where n > is the 
prefix of length n of r^ ■ We define Paths c M and FPaths c M 
as the set of all infinite paths of A4 under policy C and their 
finite prefixes, respectively. For sqSi . . . s r , 



e FPaths%, 



we let Paths m {sqS\ . . . s n ) denote the set of all paths in 
Paths c M with prefix s s i ■ • ■ s n- 

The cr-algebra associated with A4 under policy C is defined 
as the smallest cr-algebra that contains Paths'^ {f c M ) where 
r c M ranges over all finite paths in FPaths c M . It follows 
that there exists a unique probability measure Pi ^ M on the 
a— algebra associated with M. under policy C where for any 

S Sl 



. . s n e FPaths c M , 



Pr c M {Paths c M 



(SqSi ■ ■ .S n )} = 
rio<J<ri P(si,C(s QSl . . . Si), S i+ l). 



(1) 



Given an LTL formula tp, one can show that the set 

{sosi . . . <G Paths'^ | L(sq)L(s\) ... |= tp} is measurable 
[19]. The probability for Ai to satisfy ip under policy C is 
then defined as 

Pr C M {p) = Pr^,{s si ■ • • e Paths c M \ L(s )L(si) ...\=ip}. 

For a given (possibly noninitial) state s £ S, we let Ai s = 
(S, Act, P, s,H, L), i.e., Ai s is the same as Ai except that 
its initial state is s. We define Pr^ (s \= <p) = Pr^, (<p) as 
the probability for AA. to satisfy ip under policy C, starting 
from s. 

A control policy essentially resolves all the nondeterminis- 
tic choices in an MDP and induces a Markov chain Aic tnat 
formalizes the behavior of AA. under control policy C [19]. In 
general, Aic contains all the states in S + and hence may not 
be finite even though AA. is finite. However, for a special case 
where C is a memoryless or a finite memory control policy, 
it can be shown that Aic can be identified with a finite 
MC. Roughly, a memoryless control policy always picks the 
action based only on the current state of .M, regardless of 
the path that led to that state. In contrast, a finite memory 
control policy also maintains its "mode" and picks the action 
based on its current mode and the current state of AL 

III. Problem Formulation 

Consider a system that comprises 2 components: the plant 
and the environment. The system can regulate the state of the 
plant but has no control over the state of the environment. 
We assume that at any time instance, the state of the plant 
and the environment can be precisely observed. 

The plant is modeled by a finite MDP M pl = 
{S pl ,Act,P pl ,sf nit ,W l ,LP l ). We assume that for each 



ienv 



init,i 



s E S pl , there is an action a € Act that is en- 
abled in state s. In addition, we assume that the en- 
vironment can be modeled by some MC in M 6 "" = 
{Mf nv ,Mf lv , . . .,M e ^ v } where for each i € {1, . . . , N}, 
Ml nv = {Sf nv ,Pl nv ,sl^,n e t nv ,Ll nv ) is a finite MC 
that represents a possible model of the environment. For 
the simplicity of the presentation, we assume that for all 
i€{l,...,N}, Sf nv = S env , Uf nv 
and Lf nv = L env ; hence, .Mf" 1 ,^ 
only in the transition probability function. These different 
environment models can be considered as different modes 
of the environment. For the rest of the paper, we use "en- 
vironment model" and "environment mode" interchangeably 
to refer to some Mf nv € M env . 

We further assume that the plant and the environment 
make a transition simultaneously, i.e., both of them makes a 
transition at every time step. All .M f nv <G 'M. env are available 
to the system. However, the system does not know exactly 
which .M?™" G ~M. env is the actual model of the environ- 
ment. Instead, it maintains the belief B : 'M. env —$■ [0, 1], 
which is defined as a probability distribution over all possible 
environment models such that Xa<i<jv 'B(Mf nv ) = 1. 
~Q{Aif lv ) returns the probability that Aif lv is the model 
being executed by the environment. The set of all the beliefs 
forms the belief space, which we denote by B. In order 
to obtain the belief at each time step, the system is given 
the initial belief B mit : M env ->• [0, 1] € B. Then, it 
subsequently updates the belief using a given belief update 



function r 



x S env x S e 



->• 



such that t(B, s, s') 



returns the belief after the environment makes a transition 
from state s with belief B to state s'. The belief update 
function can be defined based on the observation function as 
in the belief MDP construction for POMDPs [14]. 

In general, the belief space B may be infinite, rendering the 
control policy synthesis computationally intractable to solve 
exactly. To overcome this difficulty, we employ techniques 
for solving POMDPs and approximate B by a finite set of 
representative points from B and work with this approximate 
representation instead. Belief space approximation is beyond 
the scope of this paper and is subject to future work. 
Sampling techniques that have been proposed in the POMDP 
literature can be found in, e.g., [16], [22], [23]. 

Given a system model described by A4 p( , 'M. env = 
{Aif nv ,Mf lv , . . .,M e j? v }, the (finite) belief space B, the 
initial belief Bj„jj, the belief update function r and an LTL 
formula ip that describes the desired property of the system, 
we consider the following control policy synthesis problems. 

Problem 1: Synthesize a control policy for the system that 
maximizes the expected probability that the system satisfies 
ip where the expected probability that the environment tran- 
sitions from state s <G S env with belief B <G B to state 
s' e S env is given by £i< 4 <w B(Mf nv )Pf nv (s, s'). 

Problem 2: Synthesize a control policy for the system that 
maximizes the worst-case (among all the possible sequences 
of environment modes) probability that the system satisfies 
p. The environment mode may change during an execution: 
when the environment is in state s £ S env with belief Bel, 



it may switch to any mode Mf lv E M env with B(Mf lv ) > 
0. We consider both the case where the controller and the 
environment plays a sequential game and the case where the 
control policy and the sequence of environment modes are 
computed before an execution. 

Example 1: Consider a problem where an autonomous 
vehicle needs to navigate a road with a pedestrian walk- 
ing on the pavement. The vehicle and the pedestrian are 
considered the plant and the environment, respectively. 
The pedestrian may or may not cross the road, depend- 
ing on his/her destination, which is unknown to the sys- 
tem. Suppose the road is discretized into a finite num- 
ber of cells co,C2, . . . ,cm- The vehicle is modeled by an 
MDP M pl = (S pl ,Act,P pl ,a^ u ,W l ,D> 1 ) whose state 
s E S pl describes the cell occupied by the vehicle and 
whose action a E Act corresponds to a motion prim- 
itive of the vehicle (e.g., cruise, accelerate, decelerate). 
The motion of the pedestrian is modeled by an MC 

M env £ M env where jyprHJ _ {M\ nv , M^} . M\ nV = 

(5™,Pr,C,n e ™,L en ") represents the model of the 
pedestrian if s/he decides not to cross the road whereas 
M env = (s env ,P e 2 nv ,s e ™ t ,IL env ,L env ) represents the 
model of the pedestrian if s/he decides to cross the road. A 
state s E S env describes the cell occupied by the pedestrian. 
The labeling functions L pl and L env essentially maps each 
cell to its label, with an index that identifies the vehicle from 
the pedestrian, i.e., L pl (c ) = cf and L env (c 3 ) = cf lv for all 
j E {0, . . . M}. Consider the desired property stating that the 
vehicle does not collide with the pedestrian until it reaches 
cell cm (e-g-, the end of the road). In this case, the specifica- 
tion p can be written as ip = ( -i \J i>0 (c% ^ Cj nv ) ) U <?m- 

IV. Expectation-Based Control Policy Synthesis 

To solve Problem [T] we first construct the MDP that 
represents the complete system, taking into account the 
uncertainties, captured by the belief, in the environment 
model. Then, we employ existing results in probabilistic 
verification and construct the product MDP and extract its 
optimal control policy. In this section, we describe these steps 
in more detail and discuss their connection to Problem Q] 

A. Construction of the Complete System 

Based on the notion of belief, we construct the com- 
plete environment model, represented by the MC M. env = 
{S env x B, P env , (4™,B mlf ),n™,L emj ') where for each 
s,s' E S env and B,B' el, 

P""(.s.U-'. s'.13".)= { * if T (B, S)S ') = B' > 

otherwise 

(2) 
and L env '(s,B) = L env {s). 

It is straightforward to check that for all (s, B) <E S env xB, 
E S 'B' P ™((*. B ).( 8 '. B ')) = !• Hence, M env is a valid 
MC.' 

Assuming that the plant and the environment make a 
transition simultaneously, we obtain the complete system 



by constructing the synchronous parallel composition of the 
plant and the environment. Synchronous parallel composition 
of MDP and MC is defined as follows. 

Definition 5: Let M.\ = (Si, Act, Pi, Sj^ijIIi, L{) 
be a Markov decision process and M. 2 = 
(S 2 ,P 2 ,s inlt ,2,^2,L 2 ) be a Markov chain. Their 
synchronous parallel composition, denoted by .M1H.M2, 
is the MDP M = (Si x S 2 ,Act,V,(s m ^ l ,s mit ^)^i U 
II2, L) where: 

• For each Si,s[ E Si, s 2 ,s' 2 E S 2 and a E Act, 
P((si, s 2 ), a, (s'i,s' 2 )) = Pi(si, a, s[)P 2 (s 2 ,s 2 ). 

• For each Si E Si and s 2 E S 2 , L((si, s 2 )) = L(si) U 
L(s 2 ). 

From the above definitions, our complete system can be 
modeled by the MDP M pl \\M env . We denote this MDP by 
M = (S, Act,P,s imt ,Tl, L). Note that a state s E S is of 
the form s = (s pl ,s env ,B) where s pl E S pl , s env E S env 
and B E B. The following lemma shows that Problem [T] can 
be solved by finding a control policy C for A4 that maximizes 

Lemma 1: Let r c M = sqSi . . . s n be a finite path of M. 
under policy C where for each i, Sj = (s p ,sf nv ,Bi) E 

S pl x genv x fi ^ Theni 

Pr c M {Path e M (r e M )} = 

U <j<n {P($j\C(S0Sl...Sj),S? l + i) (3) 

E"R ( \Aenv\T>env ( „env „env \ 
1<1<N D J\ JVl i ) r i \ b j i b j+l), 

Hence, given an LTL formula ip, Pr^ (<p) gives the expected 
probability that the system satisfies ip under policy C. 

Proof: The proof straightforwardly follows from the 
definition of M env and M. ■ 

B. Construction of the Product MDP 

Let A v = (Q,2 U ,S, qi n u, Ace) be a DRA that recognizes 
the specification p. Our next step is to obtain a finite MDP 
M p = (S p , Actp,P p , s Pi init,~Kp, L p ) as the product of M 
and A v , defined as follows. 

Definition 6: Let A4 = (S,Act,P,Si n it,H,L) be an 
MDP and let A = (Q,2 n ,8,q init ,Acc) be a DRA. Then, 
the product of M and A is the MDP M p = M®A defined 
by Mp = (S p , Act, P p , Sp tini t,Ilp, L p ) where S p = S x Q, 

Sp,init 

{q} and 



\Sinit,v(qinit,L(Si n it)), Up — Q, Lp((s,q)) 



P p (( S ,q),a,(s',q')) 



P(s,a,s') if q' = S(q,L( S ')) 
otherwise 

(4) 
Consider a path r^J = (so,qo){si,qi) ... of M p under 

some control policy C p . We say that r^ is accepting if 
and only if there exists a pair (H, K) E Ace such that the 
word generated by r^ intersects with H finitely many times 
and intersects with K infinitely many times, i.e., (1) there 
exists n > such that for all m > n, L p ((s m , q m )) H H = 
0, and (2) there exists infinitely many n > such that 
L p ((s n ,q n ))C\K^%. 



Stepping through the above definition shows that given 
a path r^ = (so,qo)(si,qi) ... of M p generated under 
some control policy C p , the corresponding path SqS\ • ■ ■ on 
M. generates a word L(so)L(si) . . . that satisfies ip if and 
only if r^ is accepting. Therefore, each accepting path 
of A4 P uniquely corresponds to a path of Ai whose word 
satisfies <p. In addition, a control policy C p on A4 P induces 
a corresponding control policy C on A4. The details for 
generating C from C p can be found, e.g., in [19], [12]. 

C. Control Policy Synthesis for Product MDP 

From probabilistic verification, it has been shown that 
the maximum probability for A4 to satisfy ip is equivalent 
to the maximum probability of reaching a certain set of 
states of M p known as accepting maximal end components 
(AMECs). An end component of the product MDP M. p = 
(S p ,Act p ,P p ,s p j nit ,U p ,L p ) is a pair (T, A) where ^ 



max sgS \xl 



„«i 



T C S p and A 



T 



->Act„ 



such that (1) ^ A(s) C 
Act p (s) for all s £ T, (2) the directed graph induced by 
(T, A) is strongly connected, and (3) for all s £ T and 
a G A(s), {£ G Sp | P p (s, a,t) > 0} C T. An accepting 
maximal end component of Ai p is an end component (T, A) 
such that for some (H, K) £ Ace, HOT = and KC\T ^ 
and there is no end component (T',A') ^ (T,A) such that 
rcf and A(s) C A'(s) for all s G T. It has an important 
property that starting from any state in T, there exists a finite 
memory control policy to keep the state within T forever 
while visiting all states in T infinitely often with probability 
1. AMECs of M. p can be efficiently identified based on 
iterative computations of strongly connected components of 
Ai p . We refer the reader to [19] for more details. 

Once the AMECs of M. p are identified, we then compute 
the maximum probability of reaching Sg where Sq contains 
all the states in the AMECs of Ai p . For the rest of the paper, 
we use an LTL-like notations to describe events in MDPs. 
In particular, we use OSq to denote the event of reaching 
some state in Sg eventually. 

For each s £ S p , let x s denote the maximum probability 

of reaching a state in Sg, starting from s. Formall, x s — 

c 
sup c Pr^vi { s \= OSg)- There are two main techniques 

for computing the probability x s for each s £ S p : linear 

programming (LP) and value iteration. LP-based techniques 

yield an exact solution but it typically does not scale as 

well as value iteration. On the other hand, value iteration 

is an iterative numerical technique. This method works by 

successively computing the probability vector (x s ) s es f° r 

s — x s for all s £ S„. 



increasing k > such that limj. 

Initially, we set Xs = 1 if s < 

In the (k + l)th iteration where k > 0, we set 



Initially, we set x s , = 1 if s £ Sg and x s = otherwise. 



1 



if s £ S t 



„(fc+i) ^ '-" 

a£Act p (s) 



G 



'. — < max y " y P p (s, a,,i)x\ otherwise. 



teSr, 



(5) 
In practice, we terminate the computation and say 

(k) 

that x s converges when a termination criterion such as 



Xs~'\ < e is satisfied for some fixed 
(typically very small) threshold e. 

Once the vector (x s ) se s is computed, a finite memory 
control policy C p for Ai p that maximizes the probability 
for M. to satisfy ip can be constructed as follows. First, 
consider the case when Ai p is in state s £ Sg- In this case, s 
belongs to some AMEC (T, A) and the policy C p selects an 
action a £ A(s) such that all actions in A(s) are scheduled 
infinitely often. (For example, C p may select the action for s 
according to a round-robin policy.) Next, consider the case 
when Aip is in state s £ S p \ Sg- In this case, C p picks an 
action to ensure that Pr^(s |= OSg) — x s can be achieved. 
If x s — 0, an action in Act p (s) can be chosen arbitrarily. 
Otherwise, C p picks an action a £ Acf p nax (s) such that 
P p (s,a,t) > for some t £ S p with \\t\\ = \\s\\ - 1. Here, 
Actl nax (s) C Actp(s) is the set of actions such that for all 
a £ Actl nax (s), x s = J2tes ^ > ( s > a i't) x t an d ||s|| denotes 
the length of a shortest path from s to a state in Sg, using 
only actions in Act p nax . 

V. Worst-Case-Based Control Policy Synthesis 

To solve Problemp] we first propose a mathematical object 
called adversarial Markov decision process (AMDP). Then, 
we show that Problem [2] can be formulated as finding an 
optimal control policy for an AMDP. Finally, control policy 
synthesis for AMDP is discussed. 

A. Adversarial Markov Decision Process 

Definition 7: An adversarial Markov decision process 
(AMDP) is a tuple .M" 4 = (S, Act c ,Act A ,P, s mlt ,Ii, L) 
where S, Sinit, n and L are defined as in MDP and 

• Actc is a finite set of control actions, 

• Act a is a finite set of adversarial actions, and 

• P : S x Act c x Act a x S — > [0, 1] is the transition 
probability function such that for any s £ S, a £ Actc 

and /3 G Act A , J2tes P ( s ' a ' ^ f ) e i ' 1 i- 
We say that a control action a is enabled in state s if 
and only if there exists an adversarial action (3 such that 
^ teS P(s, a, /3, t) = 1. Similarly, an adversarial action /3 
is enabled in state s if and only if there exists a control 
action a such that J^tes P( s ' a > &i *) = 1- Let Actc(s) and 
ActA(s) denote the set of enabled control and adversarial 
actions, respectively, in s. We assume that for all s £ S, 

a £ Actc(s) and /3 £ Act A {s), S teS P ( s ' a " 8 ' *) = 1 > 
i.e., whether an adversarial (resp. control) action is enabled 
in state s depends only on the state s itself but not on a 
control (resp. adversarial) action taken by the system (resp. 
adversary). 

Given an AMDP .M- 4 = (S,Act c , Act A , P, s imt , II, L), 
a control policy C : S + — > Actc and an adversarial policy 
V : S + — >• Act A for an AMDP can be defined such that 
C(s si . . . s n ) £ ActciSn) and V(s si . . . s n ) £ Act A (s n ) 
for all sqSi . . . s n £ S + . A unique policy measure Pr ' A 
on the er-algebra associated with .M" 4 under control policy 
C and adversarial policy T> can then be defined based on the 
notion of path onX- 4 as for an ordinary MDP. 



We end the section with important properties of AMDP 
that will be employed in the control policy synthesis. 

Definition 8: Let G be a set of functions from T to V 
where T and V are finite sets. We say that G is complete if 
for any t\,t% £ T and 51,52 G G, there exists g £ G such 
that g(ti) = 51 (ti) and g(t 2 ) = 52(^2) • 

Lemma 2: Let T, W and V be finite sets and let G be a 
set of functions from T to V. Then, G is finite. Furthermore, 
suppose that G is complete. Then, for any F :UxT — > R>o 
and G : V x T -)• R>o, 

min ueW Dter • F ( u ' *) max ff£G G(g(t), t) 

= min„ eW max geG EteT F ( u ' t)G{g(t),t) (6) 

= max geG min„ eW J2teT F ( u > ^(s (*)>*)• 
In addition, 



(7) 



min„ eW Eter • F ( u ' *) mi %eG G(g(t),t) 

= min„ eW miriggG J2ter F ( u ' ^(tfC*)) *)• 

Proof: Since both T and V are finite, clearly, G is 
finite. Thus, the min and max in (|6]l-(|7) are well defined. 
Let g £ G be a function such that J2teT F ( u ' t)G(g(t),t) — 
max 9G G 2~2t<aT F ( u > t)G(g(t), t). Since G is complete, there 
exists a function g* £ G such that for all t G T, g*(t) 
satisfies G(g*(t),t) = ma,x g& c,G(g(t),t). Furthermore, 
since both F and G are non-negative and g £ G, it fol- 
lows that F(u,t)G(g*(t),t) = max g< z G F(u,t)G(g(t),t) > 
F(u,t)G(g(t),t) for all u £ U and t £ T. Taking the sum 
over all t £ T, we get 

Y,teT F ( u >t) G (9* (*).*) > Et gr F M)<TO)>*) 

= max^F(u,£)G( 3 (£),i) 



</e : - 



ter 



(8) 



But since g* € G, it follows that 



max 

g£( 



teT 



F(u,t)G(g(t),t) >YF(u,t)G(g*(t),t). (9) 



teT 



Combining §8\ and d5), we get that all the inequalities must 
be replaced by equalities. Hence, we can conclude that the 
first equality in d6]l holds. The proof for the equality in dTl) 
follows similar arguments. Hence, we only provide a proof 
for the second inequality in d6|. 

First, from weak duality, we know that 



min uGW max geG J2ter F ( u > t)G(g{t), t) 



(10) 



> max geG min„ eW J2teT F ( u > ^(sWj*)- 
For each g £ G, consider an element 

u* g £ U such that J2 teT F(u g , t)G{g{t),t) 
mm ue uJ2teT F ( u ^)G(g(t),t). Since g* £ G, it follows 
that 

maxmin^ F(u,t)G(g(t),t) >Y F K*^) G (g*(t),t). 

sG uG ter ter 

(11) 
But, from the definition of g* and u* and the first equality 
in (|6j, we also get 

Ei g T*K*>*W(*)>*) 

= min„ eW Eter ^( M ' *) max seG G(g(t), t) (12) 
= min„ eW maxggc EteT F ( u ' *)£(#(*)'*)• 



Following the chain of inequalities in ([T0|-( 12 1, we can 
conclude that the second equality in (J6]l holds. ■ 

Proposition 1: Let M A — (S, Actc, Act a, P, Sj„j t ,II, 
L) be a finite AMDP and S G C S be the set of goal states. 
Let 



x s = sup inf sup inf 



•Pr^x(ar=OS G ), (13) 



where for any n > 0, C„ = {C : 5" +1 -> 
Ac£ c I C(3 3i...s„) G Ac£ G (s„)}, D n = {V : S n+1 -> 
Ac*A I £*(S0Sl ■ • • S„) G AciA(Sn)}, C(s si...s n ) = 
C n (s si . . . s„) and £>(s Si . . . s„) = X>„(s si ■ • • s n ). For 



each fc > 0, consider a vector (xs ) s eS where Xs — 1 for 



(*h 



„(°) 



all s G 5 G , 



,(fc+i) 



n(°) - 



1 



for all sg S G and for all k > 0, 



max min > P(s, a, 8, t)x f 

aGActc(s) f3GAct A (s) ^ 



if s £ S G 

(fc) 



tes 



Then, for any s £ S, x s < xi 



< 



otherwise 

(14) 
< 1, and x, = 



lim fc . 



,(fe) 



Proof: Since Etes P ( s ' a ^i t ) e {0)1} for a ^ a e 
Acic an d /3 € Act a, it can be checked that for any k > 0, 
if xi fe) € [0, 1] for all s £ S, then x ( s k+1) £ [0, 1] for all 
s £ S. Since, x s ' £ [0, 1] for all s £ S, we can conclude 
that x { s k] £ [0, 1] for all k > and s e 5. 

Let 0- k Sa denote the event of reaching some state in Sg 
within k steps. We will show, using induction on fc, that for 
any fc > and s £ S, 



„(*0 



su PCoeC inf x>oeB sup CieC 

su Pc fc _iGC fc _i i n f-D fc -ieB fe _ 



inf 



del 



Pr^h^'S, 



G), 
(15) 

where C and T> are control and adversarial policies such 
that for any n such that < n < fc, C(s s 1 . . . s n ) = 
C„(s si • ■ ■ s n ) and P(s Si ■ ■ ■ s n ) = X>„(s Si ■ ■ • s n ). The 
case where s e 5 G is trivial so we only consider an arbitrary 
s £ S\ S G . Clearly, x { ° } = Pr$£(s |= O^ 5 G ) for 
any control and adversarial policies C and T>. Consider an 



arbitrary fc > and assume that for all s G S, (15 1 holds. 



Then, from ( 14 1, we get that for any s £ S\ Sq, 



x{ k+1) = max QeActc(s) min /3eAc4j4(s) Et e sP(s,a,^,0 
su Pc gC inf-DoeDo suPded infxiieDi ■ ■ • 
suPCfc.iSCfc.! infd-ien^! Pr^x(< h 0- k S G ). 

Since for all n such that < n < fc, S" 14 " 1 , Aci G and Act a 
are finite, it follows that C n and D„ are finite. Furthermore, 
PrJ,x(* H ^- fe 5 G ) only depends on C , 2> 0) ..-,C fc _i, 
Pfc-i. Thus, we can conclude that all the sup and inf above 
can be attained, so we can replace them by max and min, 
respectively. 

Consider arbitrary s £ S \ Sg and a £ Actc(s). 
Define a function F : ActA(s) x S — > [0,1] such that 
F(P, t) = P(s, a, B, t). In addition, define a set G = Co and 
a function G : Act c x S ->• [0, 1] such that G(C (£),£) = 



infD oG Do sup ClGCl infx, lGDl . . . sup Cfc 



inf 



D, 



c,v 



Pi\.1(£ h ° S ^g) where for any n £ {0, 



1}. 



C(s $i...s n ) = C n (s Si . . . 8 n ) and V(s s l . . . s n ) = 
T> n (soSi...s n ). Pick arbitrary ti,t% £ S and gi,g2 £ G, 
Suppose 5i(£i) = ai and 52(^2) = «2- Then, from the 
definition of Co, «i € -4.ctc(£i) and «2 € Act G fa)', 
hence, there must exists g £ G such that gfa) = a± and 
gfa) — a 2- Thus, by definition, G is complete. Applying 
Lemma |2l we get 

(fc+i) 
z s = maxagAotofa) nun J s eJ 4 ctA ( a ) maxe oe co 

Et e s p ( s ; «) & *) min x> eB ■ • • maxtfc^gCb-i 

Applying a similar procedure as in the previous paragraph 
2k times, we get 



„(*+i) 



max 

Cfc-iGCfc-i 



max mm max mm . . . 

aeActc(s) /3eAct A (s) C eC X>o6D 

min ^P( S ,a,/?,t)Pr^(th > S «S G ) 
c '- ieDfc - 1 tes 
sup inf sup inf . . . sup inf 

CoGCo^o^oCiGCi^ieDi c fc ec fc ^e D fc 
Pr^(sh^ fe+1 S G ). 

Thus, we can conclude that for any k > and s £ S\Sg, 



(15 1 holds. Furthermore, since the set of events 0- k+1 S G 



„(*0 



(fe+i) 



includes the set of events <>- k S G , we obtain xT' < x 
Finally, we can conclude that limfc_ i . 00 x s = x s using 
the fact that the sequence x s ,x s , ■ ■ ■ is monotonic and 
bounded (and hence has a finite limit) and OS G is the 
countable union of the events O- S G . ■ 

Proposition 2: Let A4 = (S, Acta, ActA,P,Si n i t ,Ti, 
L) be a finite AMDP and S G Q S be the set of goal states. 
Let 

" C -'\(a\=OS G ) 



y.s 



sup inf Pr' 
c v 



M A ^ 



(16) 



(kh 



,(0) 



For each fc > 0, consider a vector (y s ) s es where y s = 1 
for all s S S G , y ( ° } = for all s <£ S G and for all k > 0, 



1 



y 



(fc+i) _ 



max min > P(s,a,/3,i)j/ t 

aGActc(s) /3G^ct A (s) *— ' 



if s e ^g 
(fe) 



Then, for any s £ S, y s < y* < 



(0) 



tes 



,(!) 



limfc^oo y 



(fe) 



otherwise 
(17) 
< Vs and y s = 



Proof: The proof closely follows the proof of Propo- 
sition [Tj Roughly, we show, by induction on k and applying 
Lemma pi that for any fc > and s £ S, y s — 
sup c inf-p Pr j\ A {s \= 0- k S G ). We can conclude the proof 
using a similar argument as in the proof of Proposition [TJ ■ 

From Proposition [TJ and Proposition [2l we can conclude 
that the sequential game ( pj) is equivalent to its nonsequen- 
tial counterpart ( [To} . 

Corollary 1: The maximum worst-case probability of 
reaching a set S G of states in an AMDP M. A does not 
depend on whether the controller and the adversary play 
alternatively or both the control and adversarial policies are 
computed at the beginning of an execution. 



B. The Complete System as an AMDP 

We start by constructing an MDP A4 = 
(S,Act,P,Si n it,H,L) that represents the complete 
system as described in Section |IV-A| As discussed 
earlier, a state of M. is of the form (s pl , s env , B) 
where s pl £ S pl , s env £ S env and B £ B. The 
corresponding AMDP M.A of M. is then defined 
as Ma = {S,Act c ,Act A ,P A ,s mit ,Il,L) where 
Act G — Act, Act a — {/?i, • • • j/3jv} (i- e -> Pi corresponds 

to the environment choosing model A4f nv ) and for any 

pi pi 



, s 2 



£ S pl , 



„env c env 
*1 1 *2 



£ S env , Bi,B 2 € B, a € Act c 



and 1 < i < N, 



p A ((4 l ,sr v 

p P i( s Pl 




,Bi) 



a, Pi, [S' 2 ,a 2 
tenv ( „env 



pl env 
bo 



> s 2 )"% \ S 1 



„enV\ 
s 2 



if r(Bi,s,s') 

P^««?',«r".Bi).a»A. 



,B 3 » = 

) if Bi(A^f lu ) >0 
otherwise 

(18) 
= B2; otherwise, 

«f,*r w ,B 2 )) = o. 

It is straightforward to check that M.a is a valid AMDP. 
Furthermore, based on this construction and the assumptions 
that (1) at any plant state s pl £ S pl , there exists an action 
that is enabled in s pl , and (2) at any point in an execution, 
the belief Bel satisfies £ 1<i<JV B(Mf nv ) = 1, it can 
be shown that at any state s £ S, there exists a control 
action a £ Act G and an adversarial action f3 £ Act a that 
are enabled in s. In addition, consider the case where the 
environment is in state s env £ S env with belief B £ B. It 
can be shown that for all Mf™ £ M env , if B(Mf nv ) > 0, 
then ^ is enabled in (s pl ,s env ,B) for all s pl £ S pl . Thus, 
we can conclude that M.a represents the complete system 
for Problem [2] 



Remark 1: According to ( 18 1, the system does not need to 
maintain the exact belief in each state. The only information 
needed to construct an AMDP that represents the complete 
system is all the possible modes of the environment in each 
state of the complete system. This allows us to integrate 
methodologies for discrete state estimation [24] to reduce 
the size of the AMDP. This direction is subject to future 
work. 

C. Control Policy Synthesis for AMDP 

Similar to control policy synthesis for MDP, con- 
trol policy synthesis for AMDP M. A can be done 
on the basis of a product construction. The product 
of M A = (S, Act c , Act A ,P,s init ,Il,L) and DRA 
A = (Q,2 u ,5,q mtt ,Acc) is an AMDP M A = 
(Sp,Act G ,ActA,Pp,s P} init,Hp,Lp), which is defined sim- 
ilar to the product of MDP and DRA, except that the set of 
actions is partitioned into the set of control and the set of 
adversarial actions. 

Following the steps for synthesizing a control policy for 
product MDP, we identify the AMECs of M A . An AMEC 
of M A is defined based on the notion of end component as 
for the case of product MDP. However, an end component of 
M. A needs to be defined, taking into account the adversary. 
Specifically, an end component of M A is a pair (T, A) where 
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Fig. 1. The road and its partition used in the autonomous vehicle example. 



^ T C S p and A : T ->• 2 Actc such that (1) ^ A(s) C 
Actc(s) for all s £ T, (2) the directed graph induced by 
(T, A) under any adversarial policy is strongly connected, 
and (3) for all s G T, a G A(s) and /3 G Act A (s), {t G 
S p |P p (s,a,/3,i)>0}CT. 

Using a similar argument as in the case of product MDP 
[19], it can be shown that the maximum worst-case probabil- 
ity for M A to satisfy ip is equivalent to the maximum worst- 
case probability of reaching a states in an AMEC of M."£ . We 
can then apply Proposition [T] and Proposition [2] to compute 
x s , which is equivalent to y s , using value iteration. A control 
policy for Ai£ that maximizes the worst-case probability for 
A4 A to satisfy ip can be constructed as outlined at the end 
of Section [PV^Cl for product MDP. 

VI. Example 

Consider, once again, the autonomous vehicle problem 
described in Example [T] Suppose the road is discretized into 
9 cells Co, . . . , c% as shown in Figure [T] The vehicle starts 
in cell Co and has to reach cell c§ whereas the pedestrian 
starts in cell c\. The models of the vehicle and the pedestrian 
are shown in Figure [2] The vehicle has two actions Oi 
and o 2 , which correspond to decelerating and accelerating, 
respectively. The pedestrian has 2 modes A4f nv and M'^ IV , 
which correspond to the cases where s/he wants to remain 
on the left side of the road and cross the road, respectively. 
A DRA A v that accepts all and only words that satisfy ip — 

(-. \J j>Q {cf A dp v )\ U c p 8 l is shown in Figure El Finally, 
we consider the set B = {B , . . . , B 8 } of beliefs where for 
all i, Bi(Mf nv ) = 0.1? and Bi(M e 2 nv ) = 1 - 0.11 We set 
Bj„jt = Be where it is equally likely that the pedestrian is 
in mode .Mf 7 ™ or mode Aif 1 ". The belief update function r 
is defined such that the longer the pedestrian stay on the left 
side of the road, the probability that s/he is in mode A4'[ nv 
increases. Once the pedestrian starts crossing the road, we 
change the belief to Bo where it is certain that the pedestrian 
is in mode M^ av . Specifically, for all i, we let 

B if s' G {c 2 ,C4,c 6 ,c 7 } 
T(Bi,s,s')={ B z ifi = 0ori = 8 . (19) 

B,'_li otherwise 



Both the expectation-based and the worst-case-base con- 
trol policy synthesis as described in Section IV and [V] is 
implemented in MATLAB. The computation was performed 
on a MacBook Pro with a 2.8 GHz Intel Core 2 Duo 
processor. 




a 2 ,0.9 a 2 ,0.9 a 2 ,0.9 a 2 ,0.9 

(a) Vehicle model M pl 
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(b) Pedestrian model M\ nv 




(c) Pedestrian model .Mj™" 
Fig. 2. Vehicle and pedestrian models. 



^col A ->(£ 




col A -iCg 



Fig. 3. A DRA A v that recognizes the prefixes of <p = -^col U c 

:ve 
acceptance condition is Ace = {(0, {qi})} 



.pi 



where the collision event col is defined as col = V 7 >o( c ? A c e , nv ). The 



First, we consider the expectation-based control policy 
synthesis (i.e., Problem [Til. As outlined in Section IV 



first construct the MDP that represents the complete system. 
After removing all the unreachable states, the resulting MDP 
contains 65 states and the product MDP contains 53 states. 
The computation time is summarized in Table [I] Note that 
the computation, especially the product MDP construction, 
can be sped up significantly if a more efficient representation 
of DRA is used. The maximum expected probability for the 
system to satisfy ip is 0.9454. Examination of the resulting 
control policy shows that this maximum expected probability 
of satisfying ip can be achieved by applying action a±, 
i.e., decelerating, until the pedestrian crosses the street or 
the vehicle is not behind the pedestrian in the longitudinal 
direction, i.e., when the vehicle is in cell c, and the pedestrian 
is in cell Cj where j < i. (Based on the expectation, the 



probability that the pedestrian eventually crosses the road is 
1 according to the probability measure defined in (fTh.) If we 
include the belief B where B(Mf nv ) = 1 and B(M c 2 nv ) = 
0, then the expectation-based optimal control policy is such 
that the vehicle applies a,\ until either the pedestrian crosses 
the road, the vehicle is not behind the pedestrian in the 
longitudinal direction or the belief is updated to B, at which 
point, it applies a<2- Once the vehicle reaches the destination 
eg, it applies ot\ forever. 

Next, we consider the worst-case-based control policy 
synthesis (i.e., Problem |2). In this case, the resulting AMDP 
contains 49 states and the product AMDP contains 53 states 
after removing all the unreachable states. The maximum 
worst-case probability for the system to satisfy (p is 0.9033. 
The resulting control policy is slightly more aggressive than 
the expectation-based policy. In addition to the cases where 
the expectation-based controller applies a 2 , the worst-case- 
based controller also applies a 2 when the vehicle is in cq 
and the pedestrian is in c^ and when the vehicle is in c 2 and 
the pedestrian is in C5. 





MDP/ 
AMDP 


product 
MDP / AMDP 


Prob 
vector 


Control 
policy 


Total 


Expectation 


0.05 


2.31 


0.73 


0.08 


3.17 


Worst-case 


0.20 


2.73 


0.46 


0.05 


3.44 



TABLE I 
Time required (in seconds) for each step of computation. 
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Fig. 4. Expectation-based control policy. At time < t < 3, the vehicle 
applies ci\, ct2 is applied at time 3 < t < 7, after which the vehicle reaches 
the goal and applies «i forever. 



Simulation results are shown in Figure HI and Figure B] 
The smaller (red) rectangle represents the pedestrian whereas 
the bigger (blue) rectangle represents the vehicle. The filled 
and unfilled rectangles represent their current positions and 
the trace of their trajectories, respectively. Notice that the 
vehicle successfully reaches its goal without colliding with 
the pedestrian, as required by its specification, with the 
worst-case-based controller being slightly more aggressive. 

Finally, we would like to note that due to the structure of 
this example, the worst-case-based synthesis problem can be 
solved without having to deal with the belief space at all. 
As the environment cannot be in state c 2 , C4, c§ or cj when 
it is in mode .Mf™ and these states only have transitions 
among themselves, once the environment transitions to one 
of these states, we know for sure that it can only be in mode 
Ai 2 nv and cannot change its mode anymore. In states ci, 
C3 and C5, the environment can be in either mode. Based on 
this structure and Remark[T] we can construct an AMDP that 
represents the complete system with smaller number of states 
than M A constructed using the method described in Section 



V-B Exploiting the structure of the problem to reduce the 



size of AMDP is subject to future work. 
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VII. Conclusions and Future Work 

We took an initial step towards solving POMDPs that 
are subject to temporal logic specifications. In particular, we 
considered the problem where the system interacts with its 
dynamic environment. A collection of possible environment 



(g) t = 6 



(h) t: 



Fig. 5. Worst-case-based control policy. At time < t < 2, the vehicle 
applies a\. However, the vehicle moves forward during time < t < 1 
because of the uncertainties in the vehicle model. «2 is applied at time 
2 < t < 5, after which the vehicle reaches the goal and applies a\ forever. 



models are available to the system. Different models corre- 
spond to different modes of the environment. However, the 
system does not know in which mode the environment is. 
In addition, the environment may change its mode during 
an execution. Control policy synthesis was considered with 
respect to two different objectives: maximizing the expected 
probability and maximizing the worst-case probability that 
the system satisfies a given temporal logic specification. 

Future work includes investigating methodologies to ap- 
proximate the belief space with a finite set of representative 
points. This problem has been considered extensively in 
the POMDP literature. Since the value iteration used to 
obtain a solution to our expectation-based synthesis problem 
is similar to the value iteration used to solve POMDP 
problems where the expected reward is to be maximized, 
we believe that existing sampling techniques used to solve 
POMDP problems can be adapted to solve our problem. 
Another direction of research is to integrate methodologies 
for discrete state estimation to reduce the size of AMDP for 
the worst-case-based synthesis problem. 
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